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Robust quantum computation with d-level quantum systems (qudits) poses two requirements: fast, parallel 
quantum gates and high fidelity two-qudit gates. We first describe how to implement parallel single qudit 
operations. It is by now well known that any single-qudit unitary can be decomposed into a sequence of Givens 
rotations on two-dimensional subspaces of the qudit state space. Using a coupling graph to represent physically 
allowed couplings between pairs of qudit states, we then show that the logical depth of the parallel gate sequence 
is equal to the height of an associated tree. The implementation of a given unitary can then optimize the tradeoff 
between gate time and resources used. These ideas are illustrated for qudits encoded in the ground hyperfine 
states of the atomic alkalies ^^Rb and '■'■'Cs. Second, we provide a protocol for implementing parallelized non- 
local two-qudit gates using the assistance of entangled qubit pairs. Because the entangled qubits can be prepared 
non-deterministically, this offers the possibility of high fidelity two-qudit gates. 

PACS numbers: 03.67.Lx 



I. INTRODUCTION 

Quantum computation requires the ability to process quan- 
tum data on a time scale that is small compared to the er- 
ror rate induced by environmental interactions (decoherence). 
Robust computation results when the rate of error in the con- 
trol operations and the rate of decoherence is below some 
threshold independent of the size of the computational reg- 
ister The threshold theorem implies such rates exist, but it 
assumes arbitrary connectivity between subsystems as well as 
the ability to implement the control operations with a high 
degree of parallelism [ s ]. Quantum computer architectures, 
therefore, should be designed to support parallel gate oper- 
ations and measurements. At the software level some work 
has been done regarding parallel computation with qubits. 
For example, certain quantum algorithms such as the quan- 
tum Fourier transform can be parallelized [2], and there are 
techniques to compress the logical depth of a quantum cir- 
cuit on qubits using the commutativity of gates in the Clif- 
ford group Further, by using distributed entanglement re- 
sources, some frequently used control operations can be par- 
allelized [4]. 

This work concerns parallel unitary operations on qudits, 
i.e. d level systems where typically d > 2. There are several 
reasons for considering such systems. Many physical candi- 
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dates for quantum computation with qubits work by encoding 
in a subspace of a system with many more accessible levels. 
Control over all the levels is important for state preparation, 
simulating quantum processes, and measurement. In particu- 
lar, encoding in decoherence-free subspaces usually involves 
control over multiple distinguishable states. Additionally, for 
small quantum computations, a fixed unitary U £ U{d) for d 
small but larger than 2, can often be implemented with higher 
fidelity in a single qudit rather than by simulation with two- 
qubit gates. Further, at the level of tensor structures, some 
quantum processing may be more efficient with qudits, e.g. 
the Fourier transform over an abelian group whose order is 
not divisible by two [ -]. It is straightforward to show that 
naive qubit emulation of qudits is inefficient [b]. 

Fast single-qudit gate times are important in order to imple- 
ment quantum error correction before errors accumulate [7]. 
In Section II we derive parallel implementations of general 
one-qudit unitary gates, where the quantum one-qudit gate li- 
brary is restricted to a small set of couplings between two- 
dimensional subspaces (Givens rotations). The choice of this 
Givens library of one-qudit gates reflects standard coupling 
diagrams, i.e. the particular rotations obey selection rules in 
the physical system that encodes the qudit. Prior work con- 
sidered minimum-gate circuits for such generalized coupling 
diagrams but did not further optimize these circuits in terms 
of depth [o]. Parallelism is possible because quantum gates 
on disjoint subspaces can be applied simultaneously, at the ex- 
pense of additional control resources. Our method is particu- 
larly helpful for experimental implementations because it can 
be applied to a large class of systems with different allowed 
physical couplings. We provide examples for qudit control 
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with ground electronic hyperfine levels of ^^Rb and '^^Cs and 
show that it is possible to achieve impressive speed-up with 
these systems using three pairs of control fields. 

Further, in Section III we obtain depth-optimized (paral- 
lel) implementations of non-local two-qudit gates. Specifi- 
cally, we describe how these operations, which generically re- 
quire 0{(fi) elementary two-qudit gates, can be parallelized 
to depth 0{cf') using 0{(fi) maximally entangled qubit pairs 
(e-bits). While the protocol is not optimized in terms of e-bits 
consumed, it is a step forward to the goal of high fidelity two- 
qudit gates. The qubit resources can be chosen to be ancillary 
degrees of freedom of the particle encoding the qudit. Thus 
they can be prepared in entangled pairs non-deterministically 
and purified before the non-local gate is implemented. 

A third aspect of parallelism [ ] involves reducing the log- 
ical depth of a circuit by judicious grouping of single- and 
two-particle gates that can be performed at the same time step, 
assuming connectivity of the particles. This is roughly analo- 
gous to classic circuit layouts and will not be considered here. 



II. PARALLELISM IN STATE SYNTHESIS AND UNITARY 
TRANSFORMATION FOR A SINGLE QUDIT 

In typical physical systems encoding a single qudit, arbi- 
trary couplings are not allowed. Whereas we can represent 
any unitary U £U{d) as an operator generated from an appro- 
priate set of Hamiltonians, viz. U ~ e^'^;=oO''; where tj e E 
and \/—\hj E u{d) with hj = /ij, it is generally not possible to 
turn on all the couplings hj at the same time. It is a problem of 
quantum control to determine how to simulate a single-qudit 
unitary using a sequence of available couplings. 

Because quantum computations need only be simulated up 
to a global phase, we restrict ourselves to implementations of 
a generic unitary U £ SU{d). One way to a implement U is by 
a covering with gates generated by the 5u(2) subalgebras g,- it- 
acting on the subspaces spanned by the state pairs {\k), 



j,k = { 
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This is realized by a QR decomposition of the inverse unitary 
into a product of unitary (Givens) rotation matrices that reduce 
it to diagonal form D ' : 



D' 



d(d-l)/2 

n ^Jiki 



(2) 



Here, each Givens rotation can be chosen to be a function of 
two real parameters only: 



(3) 



Typically, parameters are chosen so that consecutive Givens 
rotations introduce an additional zero below the diagonal of 
the unitary. Thus a sequence of such rotations realizes the in- 
verse unitary up to relative phases, and the reversed sequence 



of inverse rotations realizes the unitary itself (up to a diago- 
nal gate). There aie d{d-- 1 )/2 elements below the diagonal; 
hence the gate count in Eq. (2). The entire synthesis then 
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Using an Euler decompo- 



sition of SU{2), the diagonal gate can be can be built using 
3(c/ — 1) Givens rotations. 

A second way to synthesize a unitary transformation is to 
use a spectral decomposition 



d-l 



U = YlWiQW^ 



(4) 
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where Wf is a unitary matrix that maps the basis state \£) to 
the eigenvector corresponding to the ^th eigenvalue of U, and 
Q is the identity matrix with its {£, I) element replaced by the 
£th eigenvalue. Each matrix wj implements a state-synthesis 
operation and can be implemented as a product of Gjk{y,^). 
The first major topic of this work is parallelism, both in state 
synthesis and in the two unitary constructions above. 

Particular physical systems exhibit symmetries that con- 
strain and refine the broad picture of unitary evolution pre- 
sented so far [9, 10]. This work focuses on systems in which a 
limited number of pairs of states can be coupled at any given 
time. The examplar system is a qudit encoded in the ground 
hyperfine state of a neutral alkali atom, where the number of 
pairs that may be coupled at once is determined by the number 
of lasers incident on the atoms. Other candidate systems for 
quantum computation, such as flux based Josephson junction 
qudits and electronic states of trapped ions, may allow this 
type of control. 

We recall how selection rules on an atom with hyperfine 
electron structure constrains the allowed Givens evolutions of 
the system [8, 11]. A pair of Raman pulses can couple states 
\F^,Mf) ^ |F|,M^). In the linear Zeeman regime, a specific 
pair of hyperfine states can be addressed by choosing the ap- 
propriate frequency and polarization of the two Raman beams. 
The coupling acts on the electron degree of freedom which 
imposes a selection rule Mdp = Mf —M'p = ±2,±1,0. To 
demonstrate the power of our unitary synthesis technique, we 
restrict discussion to the selection rule AMp = Mf — M'p = 
±1,0. This restriction is valid when the detuning of each Ra- 
man laser beam from the excited state is much larger than the 
hyperfine splitting in the excited state (A ^ Eghf) [12]. There 
is a practical advantage to restricting discussion to this selec- 
tion rule. Spontaneous emission during the Raman gate scales 
as Y~ r|i2ii22|/A^, where Q.12 are the Rabi frequencies of 
the two Raman beams and F is the spontaneous emission rate 
from the excited state. Working in the limit of large detunings 
reduces errors due to spontaneous scattering events. 

The hyperfine levels for a = 8 qudit and the induced cou- 
pling graph are shown in Figs. 1 and 2. We assume that the 
amplitude and phase of the Raman beams can be controlled so 
that each Givens rotation Gy^(y, (|)) can be generated in a sin- 
gle time step (see [8]). It is notable that while the multitude of 
hyperfine levels in atomic systems provides a large state space 
of quantum information processing, these states are sensitive 
to errors. For instance, it is possible to choose disjoint two- 
dimensional subspaces, spanned by {\Fi,Mf) ,\F-^ ,—Mf)}, 
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Me = -2 -1 1 2 

FIG. 1 : A single = 8 qudit encoded in the ground state hyperfine 
levels of ^■'Rb. A pair of lasers can couple states in different hyper- 
fine manifolds according to the selection rule AM/r = 0, ±1. 

1 2 

FIG. 2: Coupling graph for ^^Rb. 



that are insensitive to small magnetic field fluctuations along 
the quantization axis. Fluctuating fields along different axes 
have negligible effect provided a large enough fixed Zeeman 
field is applied. There are no such error avoidance codes when 
using the entire hyperfine. Hence parallelism, on a scale that 
can support error correction on a time scale fast compared to 
environmental noise, will be crucial. 



A. Achieving parallelism in state synthesis 

To implement the unitary state synthesis operator W^, we 
construct a sequence of rotations taking a particular vector to 
a given state \t). Again, this is technically the reverse of state 
synthesis: W\t) = |\(/) for a generic pure state |\(/) inverts to 
a sequence of unitaries Gjk{y,<if) accomplishing = |^). 

Thus in the application |£) will be the fiducial state, and we 
attempt to treat all possibilities. We abbreviate the rotation of 
Eq. (3) by Gjk. 

One tool for identifying sequences of rotations that produce 
is the rotation or coupling graph, in which node j is con- 
nected to node A: if a rotation between rows j and k is physi- 
cally realizable [13]. Then is constructed by the sequence 
of rotations determined by constructing a spanning tree rooted 
at £ and successively eliminating leaf nodes by a rotation with 
their parent. Recall, a spanning tree of a graph G{V,E) con- 
nects all c/ = |y I nodes of G with exactly d—\ edges from the 
set E. 

Consider, for example, the coupling graph of Figure 2. To 
perform state synthesis for |0), we can form a spanning tree 
by breaking the edge between 1 and 5, breaking one of the 
edges in the cycle 0,5,2,4, 1,6,0, and choosing the root to be 
|0). If we break the edge between 2 and 4, then the resulting 
tree has three leaves, 7 (eliminated by G07), 3 (eliminated by 



G23). and 4 (eliminated by G14), We can then eliminate the 
two resulting leaves 1 and 2, and then 6 and 5. Therefore, we 
have constructed a rotation sequence 

G05G06G61G52G14G23G07 

that synthesizes |0) in 7 steps. 

To understand the potential for parallelism, note that some 
of these rotations commute and can therefore be applied in 
parallel. This is a special case of the assertion that infinites- 
imal unitaries ihi,ih2 G u{d) may be applied in parallel iff 
[I11J12] = iff and e"''^ commute for all t real. We rely 
on the following result. 

Proposition II.l A subsequence of p rotations can be applied 
in parallel if and only if all 2p indices are distinct. 

Proof: It is easy to verify that if all four indices are distinct, 
then GjkGnm = GnmGjk- Conversely, if the four indices are not 
distinct, then the order of application matters and therefore the 
rotations cannot be applied in parallel. The result follows by 
induction on /:>. □ 

Using square brackets to group rotations that can be ap- 
plied in parallel, the 7-step rotation sequence of our example 
becomes the 4-step parallel rotation sequence 

Go5Go6[G6lG52][Gi4G23Go7] . (5) 

The next interesting question is how we might determine an 
ordering of rotations to produce a parallel rotation sequence 
with a small number of steps. To answer this question, we 
build upon an algorithm of He and Yesha [; -, Sec. 3.1]. Given 
a spanning tree, they create a binary computation tree (BCT) 
by working from the bottom up and replacing every internal 
node in the spanning tree by a leaf connected to a chain of p 
nodes, where p is the number of children of the node. They 
then attach one child to each of the new nodes. The final result 
is a binary tree. (This process is illustrated in Figure 3 for a 
spanning tree of the coupling graph in Figure 2 rooted at node 
3.) The following proposition shows that the number of steps 
in our parallel rotation sequence is equal to the height of the 
BCT, not the height of the spanning tree. 

Proposition II.2 An ordering of the rotations can be obtained 
by constructing the BCT for a spanning tree of the coupling 
graph and scheduling each rotation at time step k — j, where k 
is the height of the BCT and j is the distance of the two leaves 
of the rotation from the root of the BCT. The resulting number 
of steps is k~ 1. 

Proof: In constructing the BCT, we have split each node of 
the spanning tree that is involved in more than one rotation 
into a chain of nodes, each on a distinct level. This assures 
that rotations on the same level commute and therefore can be 
applied in parallel. □ 

The resulting ordering is within a factor of (9(log2m) of 
optimal, where m is the number of rotations [' - ]. We next 
present a direct (in fact greedy) algorithm which also orders 
the rotations for optimal parallelism. 
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FIG. 3: A spanning tree (left) and a BCT (right) for node 3 of Rb 
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TABLE I: Parallel rotation sequences for state synthesis using laser 
Raman coupled connections between hyperfine states of 

|0) Go5 [G06G52I [GeiGo?] [G14G23 



Rb. 
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[G23G07] 

[GieGo?] 
[G16G07] 



G16 [G60G14] [G05G42] 
G25 [G24G50] [G41G23] 

G32 G24 G25 [G5()G4i] 

G41 [G16G42] [G25G60] [G23G07] 
|5) G50 [G52G07I [G06G24] [G61G23] 
|6) Gei [G14G60I [G05G42] [G23G07] 
|7) G70 G06 [GeiGosl [G14G52] G23 



1 



FIG. 4: Coupling graph for '^^Cs. 



At each step, consider each leaf of the spanning tree in order 
of its distance from the root (more distant leaves first), and 
process (remove) any leaf whose rotation can be applied in 
parallel with those already chosen for processing. The two 
algorithms give the same number of steps but perhaps assign 
a different timing to some rotations. For example, the greedy 
algorithm applied to the spanning tree on the left of Figure 3 
yields 

G32G24G25[G5oG4l][Go7Gl6] , 

while the BCT on the right of the figure yields the schedule 

G32 G24 [G25 G4 1 ] [G50 G 1 e] Gov ■ 

Both rotation sequences require 5 steps. 

Therefore, we can determine an ordering for the rotations 
to perform state synthesis for \€) by considering in turn each 
possible spanning tree rooted at constructing an ordering 
for it, and choosing the ordering that provides the smallest 
number of steps. 

It is possible that resource constraints prevent us from im- 
plementing a parallel ordering. Suppose for example a limited 
number of laser beams allows us to apply only two rotations 
at at time. State synthesis for |0) (Eq. 5) can still be accom- 
plished using a 4-step rotation sequence, but it requires a non- 
trivial rearrangement: 

Go5 [G06G52] [Gei G07] [G14G23] ■ (6) 

In general, such a constrained scheduling problem is difficult 
to solve exactly, although good heuristics exist. 

B. Examples of parallelism in state synthesis 

We apply our state synthesis algorithms to rubidium and 
cesium. 

a. Hyperfine levels o/**^Rb. Only the 9 transitions cor- 
responding to the edges of the coupling graph of Figure 2 are 
allowed, and the edge between 1 and 5 will not be used in our 
algorithms, since it does not lead to speed-up [15]. 



Optimal parallel rotation sequences, constructed using 
Proposition II. 2, are given in Table I. They require 5 steps 
for |3) and |7) and 4 steps for the other kets, rather than the 7 
steps of the sequential algorithm. 

b. Hyperfine levels of ^^^Cs. The coupling graph of al- 
lowed transitions for '^^Cs is given in Figure 4. We partition 
these transitions into three groups: 

• The outer chain of (red) transitions between |15), |0), 
|13),|2),|ll),|4),|9),|6),and|7). 

• The inner chain of (blue) transitions between 1 14), 1 1), 
|12),|3),|10),|5),and|8). 

• A ladder of transitions between the two chains. 

Since d = 16, state synthesis requires 15 rotations. If the 
desired state is |3), for example, then we can use the outer 
chain of transitions to depopulate |7), |6), |9), |4) (in order) 
and then 1 15), |0), 1 13), |2), and then use the ladder transition 
from 1 11) to |3). Similarly, the inner chain of transitions can 
be used to empty 1 14), |1), |12), |8), |5), and finally |10). This 
pattern of using the outer chain, the inner chain, and a single 
ladder transition accomplishes state synthesis for an arbitrary 
state. 

Complete parallelism is possible in the application of ro- 
tations from the outer chain with those in the inner, since no 
state is involved in both chains. If two rotations can be applied 
at once, then we need 9 steps for state synthesis to 1 15) or |7) 
and 8 steps for the other kets. We illustrate such a scheme 
in Figures 5 and 6, marking each transition with the step at 
which it is used. 

1 2 3 4 5 6 
15 14 13 12 11 10 9 8 7 

FIG. 5: State synthesis for 1 1) for the Cesium alkali using two-way 
parallelism. All transitions are directed toward 1 1). 
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C. Parallelism in one-qudit unitary processes 

Recall that a state synthesis routine yields routines for re- 
alizing arbitrary one-qudit unitary evolutions in (at least) two 
different ways: by invoking the QR matrix decomposition 
(Eq. 2) or by the spectral theorem (Eq. 4). The number of par- 
allel steps for a generic unitary can be significantly greater 
when using the spectral theorem. For example, for ^^Rb, 
the spectral decomposition would take 68 steps plus the steps 
needed to apply the phases. The number of steps to apply 
parallel QR is much less; with 3-way parallelism it is at most 
2n — 3 = 13 (n = 8) plus the steps to apply the phases. Also 
note that the sequential QR requires «(n — l)/2 = 28 steps, so 
this is a considerable speedup. 

A rotation sequence that achieves this bound of 13 steps 
for QR can be constructed using the precedence graph 
for the computation [16]. Suppose we order the rows as 
7,5,0,6, 1,4,2,3. We usually use rotations that eliminate an 
element in any row by a rotation with the element directly 
above it, but in the first column we use the rotation sequence 

G7oGo5Go6[G52G6l][Gi4G23]- 

This sequence specifies predecessors for each rotation in the 
first column. Define the predecessors of a rotation for columns 
after the first to be the rotations zeroing elements to the south, 
west and northwest, if those rotations exist. Each rotation 
can be performed after all of its predecessors are completed. 
Therefore, the numerical value of each entry below the diago- 
nal in the following matrix denotes the step at which the entry 
can be zeroed: 
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Thus, using 3-way parallelism, an arbitrary unitary can be ap- 
plied in 13 steps, plus the steps for phasing. 

If only 2-way parallelism is allowed, then more steps are 
necessary. We schedule rotations by cycling through the 
columns in round-robin order (right to left), scheduling at 
most one rotation per column, until all rotations are sched- 
uled. If the predecessors of the column's next rotation are 
scheduled, then that rotation is scheduled for the earUest avail- 
able time step after their scheduled steps. The resulting time 
steps are: 

5 4xxxxxxx 

6 10xxxxxx 

6 3 811 X X X XX 

1 279 12 xxxx' 
4 1 5 8 11 13 X X X 

2 2469 12 14 XX 

3 1 3 5 7 10 13 15 X 



These 15 steps are optimal for 2-way parallelism; the last two 
rotations must be applied sequentially, so the 28 rotations can- 
not be applied in 14 steps. 

A similar construction using the Cesium cou- 
pling graph shows that at most 29 steps are required 
using 7-way parallelism. We order the rows as 
15,14,0,13,1,12,2,11,3,10,4,9,5,8,6,7. The rotations used 
in the first column are 

Gl5,oGo,I3Gi3,2G2,llGil,4G4,9G9,5G9^6 
[Go,14Gi3,lG2.12Gli,3G4,loG53G6^7], 

while in other columns we use rotations that eliminate an ele- 
ment in any row by a rotation with the element directly above 
it. The time steps are as follows: 
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If fewer parallel resources are available, we can again 
reschedule our steps as done above for Rubidium. For 3-way 
parallelism, for example, we can schedule the rotations as 
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A summary of the tradeoff between resources and gate 
times with qudits encoded in ground hyperfine levels of ^^Rb 



1 2 3 4 5 6 
15 14 13 12 11 10 9 8 7 

FIG. 6: State synthesis for |7) for the Cesium alkali using two-way 
parallelism. 
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TABLE II: Number of parallel steps to synthesize a generic unitary 
operation I/, up to a diagonal gate D, on a single atomic qudit. Each 
Raman pair of laser beams counts as a single resource and the logical 
depth is the number of parallel Raman gate sequences needed in our 
QR diagonalization of U . The number in parenthesis is our best lower 
bound. The tradeoff between time and resources is evident. 



Parallelism 


Logical Depth: '^'Rb (d=8) 


'•^■^Cs (d=16) 


7-way 


13(11) 


29 (26) 


6-way 


13(11) 


30 (26) 


5 -way 


13(11) 


31 (26) 


4-way 


13(11) 


35 (26) 


3 -way 


13(11) 


42 (42) 


2-way 


15 (15) 


62 (61) 


1-way 


28 (28) 


120 (120) 



and "^Cs is given in Table II. As noted above, the 2-way con- 
struction of 15 steps for ^^Rb is optimal. Similar reasoning 
gives the 2- and 3-way lower bounds for '^^Cs; for example, 
118 rotations divided by 3 gives 40 steps plus two final steps 
for the last two rotations. The other lower bounds in the ta- 
ble are obtained assuming a completely connected coupling 
graph and («/2)-way parallelism. In that case, if « = 2'', we 
can insert n/2 zeros in the first column at step 1, up to «/4 
zeros in the first two columns 2 at step 2, ... , 1 zero in the first 
p columns at step p, and then start the reduction in the yth 
column for y = /:> + 1 , . . . , n — 1 at step log2 n + 2(y — log2 n), 
for a total of log2 n + 2(n — 1 — log2 n) steps. Other choices of 
rotation sequences may reduce some entries in the table. 



D. Parallel diagonal gates 

Up to this point our discussion has counted the number of 
parallel steps needed to construct any single-qudit unitary up 
to a diagonal gate D. Synthesizing the diagonal gate is unnec- 
essary if the target qudit will remain dormant until a measure- 
ment in the computational basis. However, if the qudit will 
be targetted by subsequent operations then it will be neces- 
sary to phase the basis states of the qudit appropriately. We 
next consider parallel constructions for D. There are two vari- 
ations of this problem to discuss. In the first, we define a gate 
to be an evolution by the generator A-^ where j and k are 
paired levels. In the second, the gate library is restricted to 
Givens rotations (Eq. 3) as is the case in systems controlled 
with Raman laser pairs. Here one cannot realize a diagonal 

Hamiltonian directly but rather may simulate e' > * using an 
Euler angle decomposition. 

First, note that the D gate itself need only be simulated 
up to a local phase: e.g., we may chose D G SU{d). Simu- 
lating a diagonal gate with c/ — 1 independent phases should 
require appropriate couplings between c/ — I pairs of states. 
There is a large amount of freedom in the choice of the 
set of the d — I state pairs: any D E SU (d) can be writ- 
ten D = J^^I^^'je'^J'" *'"^;;"*", provided the set of edges E = 
{{jm,km)} creates a spanning tree of the coupling graph. For 
{'^; ir • Ui^) "= ^} spans the diagonal subalgebra of su{d). 



and therefore we may construct {<^j„,.k„,} by solving d—1 lin- 
ear equations [8]. Since diagonal gates commute, the simu- 
lation (in terms of X^j ^) is maximally parallel, requiring one 
step. If only A:-wise parallelism is allowed, then the number of 
steps is [(« — l)/k] . 

We next consider the case that only A-^^, and X*^. are al- 
lowed. Again choose any spanning tree for the coupling 
graph and construct {<^j„,.k„, } by solving d—l linear equations. 
Color the edges of the tree so that no node has two edges of 
the same color (For example, in Figure 3 we need 3 colors be- 
cause node 2 has 3 edges.) Now for any edge {j,k), we may 

indeed realize e"''^^*'^^-* = e"' '•'•'^'^j.*e"^''''''^'^J-*e"^''''''^'^> * for ap- 
propriate timings. Evolutions e"'^>* and e"-^J.* do not com- 
mute and may not be applied in the same time step. Yet we 
may group the evolutions for a single color - black, for exam- 
ple - in three time steps as 

Given a sufficient number of operations per step, this realizes 
D in 3c parallel steps, where c is the number of colors, re- 
gardless of the number of levels in the spanning tree. Hence, 
the construction is optimized by choosing a spanning tree that 
minimizes the number of colors. The number of colors c is 
bounded by the maximum valency c,„ of any node in the cou- 
pling graph; if the coupling graph itself is a tree, then the 
number of colors is exactly c,„. When control resources are 
limited, we make a similar coloring, but limit the number of 
edges of a given color to the maximum number of operations 
allowed per step. 

The spanning tree of Figure 3 for ^^Rb requires 
three colors for the edges. A diagonal compu- 
tation can be done with the gate sequence D = 

which requires 9 parallel Raman pulse sequences. Similarly, 
Cesium requires 9 parallel Raman pulse sequences. 

The above treatment works for synthesizing an arbi- 
trary diagonal gate D without prior processing. However, 
generically, the gate D follows the diagonalization process 
process described in Sec. II C. In that case some of pairwise 
phasing operations can be subsumed in earlier steps therefore 
reducing the total number of Raman pulse sequences. First, 
since Proposition II. 1 can be extended to any unitary, not just 
rotations of the form Gjk, we can apply a phase correction 
using edge {j,k) as soon as we are finished with those two 
rows in the diagonalization. Second, we are allowed to 
choose an edge set for phasing different than the one we used 
for diagonalization. For example, using 3-way parallelism for 
Rubidium, at times 11, 12, and 13 of the diagonalization, we 
can apply a phase correction using edge (0,6); at times 14, 
15, and 16 we can use (7,0), (6,1), and (5,2); and at times 
17,18, and 19 we can finish by using (0,5), (1,4), and (2,3). 
A similar idea works for Cesium using 7-way parallelism: 
at times 27,28, and 29 use edge (0,14); at times 30,31, 
and 32 use (15,0), (14, 1), (13,2), (12,3), (1 1,4), (10,5), 
and (9,6); and at times 33,34, and 35 use 
(0,13), (1,12), (2, 11), (3, 10), (4,9), (5,8), and (6,7). In 
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Rubidium, six extra Raman pulse sequences is optimal for 
phasing when nodes |2) and |3) are involved in the last 
rotation, and six pulses ending on |6) and |7) is optimal for 
Cesium. We require no more than three and six simultaneous 
couplings respectively, which is also the number required for 
optimal diagonalization. 



III. PARALLELIZED NON-LOCAL TWO-QUDIT GATES 

In this section we propose an implementation of an arbitrary 
non-local unitary f/ e f/ {d~) between two qudits A and B. We 
suppose the qudits are spatially separated in some quantum 
computing architecture, yet this architecture has the capability 
to (i) prepare a large reservoir of maximally entangled (d = 2) 
qubits and (ii) the ability to shuttle halves of such Bell pairs so 
that they are spatially close to qudits A and B. Hence, part of 
the costing is the number of such Bell pairs (e-bits) consumed 
in the nonlocal gate. To be clear, we describe only a nonlocal 
two-qudit gate rather than a teleported two-qudit gate meaning 
that quantum operations are performed on two qudits rather 
than four. The optimization of such a nonlocal gate presented 
here arises by considering its component rotations in terms of 
the QR decomposition. 

Before stating the protocol, we argue for why it is needed. 
Two criteria must be satisfied to realize high performance 
two-qudit gates. First, nonlocality itself is desirable; most 
quantum computer architectures impose spatial limitations on 
inter-qudit couplings. It is very inconvenient to simply accept 
this limitation, since fault tolerant computation requires con- 
nectivity [ i ]. Now one might also suggest directly swapping 
qudits in order to achieve the required connectivity. Yet the 
swap gate itself may be faulty, and thus the resources required 
to make swapping fault tolerant might be prohibitive. 

Second, reliable computation requires high fidelity two- 
qudit gates. Usually, Hamiltonians capable of entangling 
distinct qudits are difficult to engineer (at any fidelity) and 
would require effort to optimize for fidelity. Thus, one would 
likely choose a particular physically available entangling two- 
qudit Hamiltonian, e.g. perhaps the controlled-phase gate 
Pq = e™|0>(0|®|0><0|, and then exploit this with local unitary 
similarity transforms to achieve arbitrary Givens rotations be- 
tween qudit levels. The entire process might simulate any 
U e U{d^) [8]. Local unitary similarity transforms arose nat- 
urally in this discussion, and it further implies that two-qudit 
nonlocality in such a scheme would follow, given a nonlocal 
protocol for a single entangling Hamiltonian. 

It is difficult to design an architecture for two-qudit uni- 
taries which allows for both high-fidelity and high connectiv- 
ity. Some possibilities are noteworthy. As opposed to a chain 
of swapping operations, distant qudits might be swapped us- 
ing entanglement resources. Then a non-local gate between 
qudits A and B can be done by teleporting A to a location 
neighboring B, performing an entangling gate between A and 
B and teleporting back. Typically, entangled qudits (e-dits) 
rather than e-bits are used to teleport qudits; i.e. each telepor- 
tation is performed with the assistance of a maximally entan- 
gled two-qudit resource |<I>j") = J^L^to 1^)1^) t^^]- While 



the amount of entanglement consumed using the resource 
|<I>J) is low, i.e. one e-dit= \og{d) e-bits, such a protocol 
would still require high fidelity (local) two-qudit gates be- 
tween A and B. As hinted at in the first paragraph of this 
section, a second alternative is to teleport the gate itself us- 
ing an adaptation of the two-qubit gate teleportation proto- 
col [19, 20] [21, §2]. In such an implementation one would 
build a generic two qudit gate between A and B using mul- 
tiple applications of a gate teleport sequence where each se- 
quence consumed two e-dits. Such a protocol would require 
the preparation of high fidelity e-dits and the implementation 
of generalized two-qudit Bell-measurements between a mem- 
ory qudit and one half of an e-dit. 

Here we describe a simple protocol for implementing a non- 
local two qudit gate, which has the advantage that one need 
only prepare high fidelity e-bits. If several qubits can be con- 
trolled together, the entire non-local gate can be parallelized 
to reduce the overall implementation time by a factor of 0{d). 

A. A non-local controlled unitary gate 

Consider a one-qudit unitary gate controlled on dit (J — 1): 

d-2 

Ai(V) = Y.\j){j\®h, + \d-l){d-l\®V . 

.1=0 

We label the control qudit A and the target qudit B. This sub- 
section describes how such a gate can be implemented using 

1 . operators local to A and B 

2. an e-bit. The ancilliary e-bit is encoded in a pair of 
qubits, say Ai andZ?i, again with A i neighboring A and 
Bi neighboring B. The joint state of the ancilla is the 
Bell pair |4>+) = (1/V2) (|00) + | 11))a,,b, - 

3. a controlled-not gate controlled on the qudit and tar- 
geting an ancilliary qubit. As a formula, this gate is 

Ai(o-^) = rjZo\j){j\(Sh~\d~l){d-l\(Sa-\ 

4. a spatially local controlled V gate with control an an- 
cilla bit. As a formula, this is (also, confusingly) 

Ai(y) = |o>(o|®irf + |i)(i|®y. 

The controlled gate of item 3 should be considered to be a 
primitive, highly engineered as discussed in the previous sec- 
tion. The controlled gate of item 4 might be decomposed into 
local gates and the gate of item 3 using standard techniques 
[8, 9, 10]. 

The procedure for realizing /\i{V) is as follows. 

• Apply with A as control and Ai as target. 

• Measure (I2 +CT^j ) /2. Send the one bit (c-bit) classical 
measurement result, mi, to the side of qudit B. 

• Perform on the B side of the architecture. 

• Apply the operation 1 0) (0 1 (g) 1 j + 1 1 ) ( 1 1 V with B 1 as 
control and B as target. 
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• Measure (I2 + ag|)/2 and send the c-bit measurement 
resuh 1112 to A. 

• Apply the a relative phase to state c/ — 1 of A iff 1112 = 1, 
i.e. apply ^ e™"2k/-i>AA(^/-i|. 



B. Bootstrap to nonlocal two-qudit state synthesis 

We next consider the question of building a nonlocal two- 
qudit state synthesis operator. We may write any two-qudit 
state |\|/) = TfjZo \J) ® l¥j)' where the kets are unnor- 
malized. We also take the convention that W\\\t) ~ |0) so that 
W |0) = \\\t). Using the partition of the state vector, one may 
show that any two-qudit state-synthesis operator W can be de- 
composed into d—1 elementary controlled-rotation operators 
as follows [6]: 



W 



(^J-l-;®Ml(lrf'»^0). 



(8) 



Here we intend Fj = \j) {d - l\ + \d - l){j\+ L,^,,,/_i |^) (^| 
to be a state-flip operator. The single-qudit operators Vj are 



chosen so as to perform Vj |\|/; 



.1/2 



|0), where 



[8]. Then Vq clears the remaining nonzero amplitudes. 

The last subsection implicitly describes a non-local imple- 
mentation of a controlled (one-qudit state synthesis) operator 
W, in that it details a scheme for the non-local /\i{Vd-i-j)- 
The resulting circuit for W is shown in Fig. 7 and requires 
d—1 e-bits and 2{d — I) c-bits. Remarkably, the protocol can 
be parallelized to 7 computational steps. Here by a single step 
we mean a set of operations that is no more time consuming 
than a controlled one qudit rotation Ai(y), which itself can 
be decomposed into controlled-phase gates and single qudit 
Givens rotations if so needed. The only nonobvious parallel 
step is step 4. Note that the operators Vj generally do not com- 
mute. However, just before and just after this step, the usual 
teleportation case study shows that the state of the system 
lies within the span of those |^) = |^o)a ®^=i \^j)Bj ® \kd)B 
in which at most a single kj is one for I < j < d — I. Let P 
denote the projection of Hilbert space onto the span of all \k) 
as above. If Q denotes the central product of Equation 8, we 
have 



PQP 



d-l 



-itj\l)BjBj{l\l>hj 



(9) 



For the map of Hamiltonians h 1-^ PhP has image equal to the 
span of all (g)/z. Moreover, for ji ^ 72 and any Her- 

mitian/2i,/22, wehave [ |1)b^jB^.j (l|<8)/2i, |1)b,.^b,.^ (l|(8)/!2] = 
0. Hence we can generate the gates in step 4 in parallel. The 
operations in step 5 correspond to measurement of qubits Bj 
in the Hadamard basis and count as a single parallel operation. 



C. Spectral decomposition bootstrap to nonlocal gates 

This protocol can be extended to implement an arbitrary 
non-local unitary U G U{d^) between A and B. Consider the 




FIG. 7: A non-local two qudit gate U = that realizes the state- 
synthesis U\0)a.b = I on qudits A and B using — 1 ancillary 
qubit pairs (indicated by sawtooth lines) each prepared in the state 
= l/\/2(|00) + |11))a,,b,. Each qubit Aj{Bj)m the en- 
tangled resource can constitute a new particle or a distinct degree 
of freedom of qudit A{B). ControUed-not gates between A and Aj 
are conditioned on the basis state \ j)A, as indicated by the shading 
of the control bubble. The notations are: double lines for classi- 
cal controlled operations dependent on qubit measurement outcomes, 
H = e™(t^'+<'')/2 V^, and Pj = e™!-/) <-'1 . The sequence of steps that can 
be implemented in parallel is indicated at the bottom. 



spectral decomposition Eq. (4) of U which involves multi- 
ple applications of state-synthesis operators W and controlled 
phase operators C. The controlled phase operators are locally 
equivalent to the operator A 1 [1^/ + (e'* — l)\d — I) {d — l\] and 
thus can be implemented in one step using one e-bit and two 
c-bits. Thus, any two-qudit unitary can then be built using 
£ = 7 X 2d^ + d^ = 15d^ parallel operations with the assis- 
tance of#e = 2x{d-l)xd^ + d^^2d^-d^ e-bits and 2#e 
c-bits. 

Recently, an alternative construction of two-qudit opera- 
tions using qubit entanglement resources was proposed [22]. 
That work describes how a single e-bit and two c-bits suffice 
to implement a one parameter subgroup of U{d^) between two 
distant qudits A and B with probability one. Specifically, their 
protocol realizes unitaries of the form V ((j)) = exp[ii^UA <E) Ub] 
where the operators Ua,Ub are unitary and Hermitian. How- 
ever, the authors do not provide an algorithm for generating an 
arbitrary two-qudit unitary nor do they estimate the number of 
e-bits consumed in a covering of U{d^) with such unitaries. 



D. Improved fidelity by purification 

Our protocol requires local high fidelity operations between 
qudit A and a set of qubits {Aj} (similarly between B and 
{Bj}) as well as high fidelity local unitaries. In principle, the 
entangling operations might be made error tolerant. Rather 
than use ancillary qubits that are distinct particles, we might 
use composite particles endowed a inherent tensor product 
structure ~ ^qudit ® -^ancilla where one subsystem is used 
to encode the qudit and the ancillary subsystem is used to 
assist in two-qudit gate performance. Diir and Briegel [23] 
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showed that one can perform extremely high-fidelity two qubit 
gates with this partitioning. In their protocol, information is 
encoded in one two-dimensional degree of freedom of each 
particle, say spin. Entanglement between particles is gen- 
erated using ancillary degrees of freedom such as quantized 
states of motion along x,y or z- The prepared entanglement 
may not be perfect. Yet by using nested entanglement purifi- 
cation with two or more degrees of freedom, one can prepare 
a highly entangled state in the ancillary degrees of freedom 
with nonzero probability. If a purification round fails, then the 
entangled state can be reprepared without disturbing the quan- 
tum information encoded in the other degree of freedom (here 
spin). Given this, a non-local CNOT gate can be implemented 
between the encoded qubits. 

Their protocol is readily extended to non-local gates be- 
tween qudits using ancillary qubit degrees of freedom as dis- 
cussed above. The critical assumption for robustness is that 
gates which couple different degrees of freedom of the same 
particle can be performed with much higher fidelity than gates 
which couple different particles. The assumption is frequently 
valid because coupling two spatially distinct particles usually 
involves interactions mediated by a field which can also cou- 
ple to the environment and thus decohere the system. In con- 
trast, gates between different degrees of freedom of the same 
particle, such as coupling spin to motion in trapped ions [24] 
or atoms [""^] can often be implemented with high precision 
using coherent control. 

IV. CONCLUSIONS 

Quantum computation with qudits requires more control at 
the single particle level than with qubits. It might be expected 
that the additional time needed to control all the levels would 
be prohibitively long in terms of memory decoherence times. 



We have shown how parallel (time-step optimized) one-qudit 
and two-qudit computation help surmount such difficulties. 
Given a qudit with a connected coupling graph, the time com- 
plexity for constructing an arbitrary unitary can be reduced at 
the expense of additional control resources. Even for systems 
with little connectivity between states, such as in the case of 
a qudit encoded in hyperfine levels of an atomic alkali, the 
number of parallel elementary gates can be made close to the 
optimal count for a maximally connected state space. For the 
purposes of two-qudit gates, we found a non-local implemen- 
tation of an arbitrary unitary using 0[cP-) parallel steps. The 
protocol uses 0{d^) e-bits which could be in principle be pre- 
pared and distributed ahead of time with high fidelity. 

Some outstanding issues remain. First, our treatment fo- 
cused on systems with allowed couplings between pairs of 
states. In other systems, the selection rules may dictate a dif- 
ferent set of subalgebras to be used for quantum control, e.g. 
spin-y representations of the algebra su(2). Some particular 
computations may be realized with much greater efficiency 
using such generators. Second, fault tolerant computation re- 
lies not on exactly universal computation, but rather by ap- 
proximating unitaries using a discrete set of one and two-qudit 
gates. It would be worthwhile to investigate optimal protocols 
for implementing a discrete set of fault tolerant non-local two 
qudit gates using entangled qubit pairs. 
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